Tacotron 2: human like text-to-speech

Tacotron 2: human like text-to-speech 1

Text-to-speech is the research subject for many organizations but for the first time Google has submitted a research paper on this. Google already had a system named as Tacotron but this time company updated this with great improvements and considering that this is next version of Tacotron named as “Tacotron2”. Google claiming this as near-human accuracy.

    tacotron2

The working of the system was described by Jonathan Shen and Ruoming Pang, Software Engineers, Google Brain and Machine Perception Teams

In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.

Full description was given in the research paper titled “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

Google demonstrated the system using two sample audios. Where it was found that it is comparable to professional recording and it can read complex and mixed sentences too.
However, there are still some problem exist with text-to-speech and system is having difficulties pronouncing some complex words like “decorum” and “merlot”. In some cases, some noise occurred, that is also needs to be tackled. Also, listener can not recognized that speaker is sad or happy while listening it. Generating real time sound is also a problem for now. All these thing are interesting things to solve for Google.